13. Recovering From Failure

ND0063 C1 L4 13 Recovery From Failure Video

The key to recovering from failure is to understand how the failure occurred. Once you have this understanding, you can be sure that you've fixed the root cause, and you will know how to prevent a reoccurrence. Finding a root cause can be straightforward is there is a direct cause and effect (we changed A, and B immediately happened). Some issues are harder to identify, and some can only be identified by understanding "what changed?".

CloudTrail is a great tool for determining what changed. It allows you to audit and review changes and commands run with all AWS credentials associated with your account. Once you've discovered what was changed and who/what changed it, you can resolve the issue and ensure that the incident is not repeated.